BUG fixing module first() and DateOffset #52487

DeaMariaLeon · 2023-04-06T13:51:41Z

[ X] closes BUG: DataFrame.first has unexpected behavior when passing a DateOffset #45908
[ X] Tests added and passed if fixing a bug or adding a new feature
[] All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

Also fixes part of #51284 (I have not worked on last())

~~Edge cases were only fixed for offsets passed as string.~~
~~Trying to make it work for DateOffset would make it more complicated. Should this be fixed or should DateOffset be removed from the function (and docs)?~~
~~Will DataOffset be deprecated or fixed? it has bugs or the documentation is incorrect.~~

As commented by @MarcoGorelli:

Functions first() and last() assume that the Series or DataFrame have a sorted DatetimeIndex. Only last() has it specified in the documentation.
first() could be easily done as follows (in case it's deprecated):
df[df.index < df.index[0] + DateOffset(days= 15)]

I also tried to use first() with relativedelta but I couldn’t make it work.

Edit:

first() doesn’t work for DateOffset. With this PR DateOffset works, but it now seems that there are inconsistencies when given a string or MonthEnd.
I think that since the PRs to fix #29623, first() with a string doesn’t work as normal offset arithmetic.

MarcoGorelli

Thanks for working on this!

I'm not super-familiar with offsets but I left a couple of comments, will get back to this

MarcoGorelli · 2023-04-06T14:32:08Z

pandas/core/generic.py

-        offset = to_offset(offset)
-        if not isinstance(offset, Tick) and offset.is_on_offset(self.index[0]):
+        if type(offset) is offsets.DateOffset and offset.is_on_offset(self.index[0]):
+            end_date = self.index[0] - Timedelta(1, unit="d") + offset


I don't think we can hard-code Timedelta(1, unit='d'), e.g. it would break DateOffset(minutes=2):

(Pdb) p self.index[0] Timestamp('2018-04-09 00:00:00') (Pdb) p self.index[0] - Timedelta(1, unit='d') Timestamp('2018-04-08 00:00:00') (Pdb) p self.index[0] - Timedelta(1, unit='d') + offset Timestamp('2018-04-08 00:02:00')

Thank you for the review!
Yes, we should not hard-code like that, I did it as a draft. I wanted to show how different the behavior of DateOffset is from the "Tick" object.
The whole first() was not implemented for DateOffset, there were no tests either.
Then there are bugs on it (on DateOffset itself).
In the end, I wanted to know if patching first() was correct, if DateOffset will be deprecated or if it should be fixed.

Some of the issues of DateOffset:

DateOffset.is_on_offset is different than the offset done with a string.

DateOffset.name doesn't work

At some point adding an offset to it, was going backwards! (as in self.index[0] + offset = earlier date)

pandas/core/generic.py

MarcoGorelli · 2023-04-11T13:33:01Z

Functions first() and last() assume that the Series or DataFrame have a sorted DatetimeIndex. Only last() has it specified in the documentation.

do you want to start by making a docs PR to fix this? I think this would need doing regardless of whether DateOffset were supported

DeaMariaLeon · 2023-04-11T13:36:15Z

do you want to start by making a docs PR to fix this?

Yes

MarcoGorelli

Thinking about this further, would it work to just do

diff --git a/pandas/core/generic.py b/pandas/core/generic.py
index e16ef3580d..274ab41260 100644
--- a/pandas/core/generic.py
+++ b/pandas/core/generic.py
@@ -43,6 +43,7 @@ from pandas._libs.tslibs import (
     Timestamp,
     to_offset,
 )
+from pandas._libs.tslibs.offsets import DateOffset
 from pandas._typing import (
     AlignJoin,
     AnyArrayLike,
@@ -9073,6 +9074,13 @@ class NDFrame(PandasObject, indexing.IndexingMixin):
         if len(self.index) == 0:
             return self.copy(deep=False)
 
+        if isinstance(offset, DateOffset):
+            end_date = end = self.index[0] + offset
+            if end_date in self.index:
+                end = self.index.searchsorted(end_date, side="left")
+                return self.iloc[:end]
+            return self.loc[:end]
+
         offset = to_offset(offset)
         if not isinstance(offset, Tick) and offset.is_on_offset(self.index[0]):
             # GH#29623 if first value is end of period, remove offset with n = 1

? (I'm showing the diff of generic.py compared with upstream/main)

DeaMariaLeon · 2023-04-11T16:00:54Z

would it work to just do

No... look:

>>> x = pd.Series([1] * 10, index=pd.bdate_range("2010-03-31", periods=10))
>>> result = x.first(pd.DateOffset(months=1))
>>> result
2010-03-31    1
2010-04-01    1
2010-04-02    1
2010-04-05    1
2010-04-06    1
2010-04-07    1
2010-04-08    1
2010-04-09    1
2010-04-12    1
2010-04-13    1
Freq: B, dtype: int64
>>> result2 = x.first("1M")
>>> result2
2010-03-31    1
Freq: B, dtype: int64

It doesn't need to start with an "end of the month" to fail.
The reason is that if offset is a DateOffset:

offset.is_on_offset(self.index[0])
True

If offset is string:

offset.is_on_offset(self.index[0])
False

We need to substract the Timedelta(1, unit= ___) - we could take the unit from the DateOffset.kwds or transform DateOffset to string, to start with. But I'm not sure if that's a good solution.

MarcoGorelli · 2023-04-11T16:14:06Z

Isn't that fine though?

Because DateOffset(months=1) is different from MonthEnd(1) (and 'M', somewhat confusingly, means 'month end' - but there is a PR open to deprecate and rename that)

In [4]: pd.Timestamp.now() + pd.DateOffset(months=1)
Out[4]: Timestamp('2023-05-11 16:26:01.139005')

In [5]: pd.Timestamp.now() + pd.tseries.offsets.MonthEnd(1)
Out[5]: Timestamp('2023-04-30 16:26:14.195128')

I think it makes sense that

x.first(pd.DateOffset(months=1))

would take the rows within one calendar month of the first element, whereas x.first('1M') would only go until the first month-end

DeaMariaLeon · 2023-04-11T17:43:01Z

Oh, OK!

but there is a PR open to deprecate and rename that

Does that mean that both (string offset and DateOffset) will return the same answer eventually?

Should I add examples with DateOffset in the docs, and new tests maybe?

last() will need to be fixed too

I don't need to speak @ the meeting, yaaay!

MarcoGorelli · 2023-04-12T15:12:32Z

Does that mean that both (string offset and DateOffset) will return the same answer eventually?

Sorry I was unclear, what I meant was that 'M' will be renamed to 'ME' to make it clear that it means "month end" (rather than calendar month, like DateOffset(months=1))

Should I add examples with DateOffset in the docs, and new tests maybe?

Sounds good!

DeaMariaLeon · 2023-04-12T18:21:58Z

Sorry I was unclear

No reason to apologise. Thanks for explaining.

I will add examples in the docs on another PR if that's fine.

DeaMariaLeon · 2023-04-14T12:42:03Z

Would somebody very kind review this "annoying" PR @MarcoGorelli ? :)

MarcoGorelli

Nice one, thanks! We can do a follow-up for last as well after

pandas/tests/frame/methods/test_first_and_last.py

DeaMariaLeon · 2023-04-17T07:46:31Z

Nice one, thanks! We can do a follow-up for last as well after

My pleasure... Yes, we can do last() as well.

It's unclear to me why you only commented about using date_range on one test, but not the others. I mean, I use it on the other tests as well.
And when I removed date_range from line 122 (def test_first_with_date_offset) the assert_equal failed because freq=None on expected. So I removed it also from the input i. I’m not sure if that’s OK.

Before I removed freqat the input, I tried adding it to expected, but the methods I tried used an offset as well.

@MarcoGorelli : it looks green

MarcoGorelli

Nice work! Just left a minor comment on the whatsnew note

I'll leave open a bit in case others (@jbrockmendel ? @phofl ?) have comments, else can merge

It's unclear to me why you only commented about using date_range on one test, but not the others. I mean, I use it on the other tests as well.

Just to not repeat myself :)

MarcoGorelli · 2023-04-17T15:40:19Z

doc/source/whatsnew/v2.0.1.rst

@@ -33,6 +33,7 @@ Bug fixes
 - Bug in :meth:`ArrowDtype.__from_arrow__` not respecting if dtype is explicitly given (:issue:`52533`)
 - Bug in :meth:`DataFrame.max` and related casting different :class:`Timestamp` resolutions always to nanoseconds (:issue:`52524`)
 - Bug in :meth:`Series.describe` not returning :class:`ArrowDtype` with ``pyarrow.float64`` type with numeric data (:issue:`52427`)
+- Fixed bug in :func:`first` when used with a DateOffset (:issue:`45908`)


does this close both issues? if so:

Suggested change

- Fixed bug in :func:`first` when used with a DateOffset (:issue:`45908`)

- Fixed bug in :func:`first` when used with a :class:`DateOffset` (:issue:`45908`)

I had done this, but changed again later (per later request)

DeaMariaLeon · 2023-04-17T17:35:41Z

Thanks @MarcoGorelli for the review (and the patience).

jbrockmendel · 2023-04-18T15:37:04Z

doc/source/whatsnew/v2.0.1.rst

@@ -33,6 +33,7 @@ Bug fixes
 - Bug in :meth:`ArrowDtype.__from_arrow__` not respecting if dtype is explicitly given (:issue:`52533`)
 - Bug in :meth:`DataFrame.max` and related casting different :class:`Timestamp` resolutions always to nanoseconds (:issue:`52524`)
 - Bug in :meth:`Series.describe` not returning :class:`ArrowDtype` with ``pyarrow.float64`` type with numeric data (:issue:`52427`)
+- Fixed bug in :func:`first` when used with a :class:`DateOffset` (:issue:`45908`)


there is no pd.first. Maybe :meth:DataFrame.first?

Change done

jbrockmendel · 2023-04-18T15:38:18Z

pandas/core/generic.py

@@ -9070,7 +9071,15 @@ def first(self, offset) -> Self:
        if len(self.index) == 0:
            return self.copy(deep=False)

+        if isinstance(offset, offsets.DateOffset):


doesn't the offset = to_offset(offset) below make this check unnecessary?

the issue is that doing this check after to_offset would make it impossible to distinguish between MonthEnd, as that would also satisfy isinstance(offset, DateOffset)

they need to be handled differently because something like DateOffset(days=15) would always return True for if not isinstance(offset, Tick) and offset.is_on_offset(self.index[0]), and so would be shifted backward by offset.base (this is why in the linked issues, the result only has a single row for DateOffset input)

jbrockmendel · 2023-04-18T15:39:38Z

pandas/core/generic.py

+            if end_date in self.index:
+                end = self.index.searchsorted(end_date, side="left")
+                return self.iloc[:end]
+            return self.loc[:end]


a lot of this looks really similar to what we have below. can any of this be de-duplicated?

DeaMariaLeon · 2023-04-21T08:07:43Z

Friendly ping @MarcoGorelli @jbrockmendel

MarcoGorelli

Wait sorry, I gave a bad review here - this introduces an inconsistency between first('M') and first(MonthEnd()) (which should mean the same thing - reminder that 'M' means 'month end'):

In [1]: ser = pd.Series([1,2,3], index=pd.DatetimeIndex(['2020-01-31', '2020-02-01', '2020-03-01']))

In [2]: ser.first(MonthEnd())
Out[2]: 
2020-01-31    1
2020-02-01    2
dtype: int64

In [3]: ser.first('M')
Out[3]: 
2020-01-31    1
dtype: int64

@jbrockmendel is there a way to tell if an offset is DateOffset but not MonthEnd? I was a bit surprised to see

In [11]: isinstance(MonthEnd(), DateOffset)
Out[11]: True

. type(offset) is DateOffset would technically work, but I don't see it being done anywhere else

(as an aside, the fact that the tests still pass indicates the relatively poor test coverage that this function had to begin with)

MarcoGorelli · 2023-04-21T13:47:01Z

pandas/core/generic.py

            # GH#29623 if first value is end of period, remove offset with n = 1
            #  before adding the real offset
            end_date = end = self.index[0] - offset.base + offset
        else:
            end_date = end = self.index[0] + offset

        # Tick-like, e.g. 3 weeks
-        if isinstance(offset, Tick) and end_date in self.index:
+        if isinstance(offset, Tick) or input_is_offset and end_date in self.index:


I can never remember if a or b and c is evaluated as (a or b) and c or a or (b and c), could we add parantheses please?

I'll change that.

DeaMariaLeon · 2023-04-24T08:22:23Z

this introduces an inconsistency between first('M') and first(MonthEnd()) (which should mean the same thing - reminder that 'M' means 'month end'):

I think the original issue #29623 was not a bug but first() (with a string) was modified to match the docstring: “For instance, ‘1M’ will display all the rows having their index within the first month.”

If first accepts a DateOffset then it should always handle this offset in a way which is consistent with DateOffset arithmetic. If the first timestamp in the index is '2020-01-31' then a MonthEnd() offset means going to the following month end:

from pandas.tseries.offsets import MonthEnd, Day
from pandas.tseries.frequencies import to_offset
from pandas.tseries.offsets import DateOffset

t = pd.Timestamp('2020-01-31')
t + MonthEnd()
Timestamp('2020-02-29 00:00:00’)

Adding for example 15 days means going 15 days forward

t + Day(15)
Timestamp('2020-02-15 00:00:00’)

Here we go to mid February with a 15-day offset. In this example, the timespan covered by an offset of month end is thus longer than that of an offset of 15 days. It would be completely unexpected for the user if the time period returned by first would be the opposite (ie shorter with a MonthEnd offset than with a 15-day offset).

For example:

i = pd.date_range('2020/1/31', periods=31, freq="D")
s = pd.Series(range(len(i)), index=i)

print(s.first(MonthEnd()).tail(5))
2020-02-24    24
2020-02-25    25
2020-02-26    26
2020-02-27    27
2020-02-28    28
Freq: D, dtype: int64

print(s.first(DateOffset(months=1)).tail(5))
2020-02-24    24
2020-02-25    25
2020-02-26    26
2020-02-27    27
2020-02-28    28
Freq: D, dtype: int64

print(s.first(‘1M').tail(5))
2020-01-31    0
Freq: D, dtype: int64

print(s.first(‘15D').tail(5))
2020-02-10    10
2020-02-11    11
2020-02-12    12
2020-02-13    13
2020-02-14    14
Freq: D, dtype: int64

So I think the options should be the following :

have first() behave with a DateOffset input with normal offset arithmetic and behave with a string input as it does now (what we have on this PR). But explain the difference in the docstring.
allow only string inputs to avoid any confusion and any change in the current string input behaviour.
change the behaviour with a string input so that it behaves consistently with month end semantics. The docstring could be changed to that effect to :'1M' will display all the rows having their index within the first month end following the first element in the index

I'm not sure if I have to (friendly) ping you :) @MarcoGorelli

MarcoGorelli · 2023-04-24T11:30:16Z

I think the original issue #29623 was not a bug but first() (with a string) was modified to match the docstring:

I'm inclined to agree, and am pretty tempted to suggest reverting the PR which changed it

I'll add it to Wednesday's agenda

github-actions · 2023-05-27T00:04:48Z

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

DeaMariaLeon · 2023-05-27T05:28:33Z

I will work on this shortly

DeaMariaLeon · 2023-06-02T07:43:57Z

Closing this one because Series/DataFrame.first() has been deprecated: #53419

DeaMariaLeon added 4 commits April 5, 2023 12:56

BUG first

0ed276f

Merge remote-tracking branch 'upstream/main' into newbug

16028ed

BUG first

d76aa11

Merge remote-tracking branch 'upstream/main' into newbug

1fee537

DeaMariaLeon requested a review from MarcoGorelli April 6, 2023 13:53

MarcoGorelli reviewed Apr 6, 2023

View reviewed changes

MarcoGorelli reviewed Apr 11, 2023

View reviewed changes

mroeschke added the Frequency DateOffsets label Apr 12, 2023

DeaMariaLeon added 2 commits April 12, 2023 19:59

BUG fixing func first with DO

fbf2d10

Merge remote-tracking branch 'upstream/main' into newbug

9d769b4

DeaMariaLeon marked this pull request as ready for review April 13, 2023 11:51

Merge remote-tracking branch 'upstream/main' into newbug

bc4423a

MarcoGorelli suggested changes Apr 14, 2023

View reviewed changes

MarcoGorelli added this to the 2.1 milestone Apr 14, 2023

DeaMariaLeon added 2 commits April 17, 2023 09:32

Modified as suggestions

45e594f

Merge remote-tracking branch 'upstream/main' into newbug

643822c

MarcoGorelli approved these changes Apr 17, 2023

View reviewed changes

DeaMariaLeon added 3 commits April 17, 2023 18:53

Merge remote-tracking branch 'upstream/main' into newbug

5a6f4c7

Corrected whatsnew

0d80894

Merge remote-tracking branch 'upstream/main' into newbug

dfc981d

jbrockmendel reviewed Apr 18, 2023

View reviewed changes

DeaMariaLeon added 5 commits April 20, 2023 09:10

fixing first

621c028

added var

4d37542

removed parenthesis

ac12fdc

Removed duplicated code

27b6377

Merge remote-tracking branch 'upstream/main' into newbug

7bf43e6

MarcoGorelli suggested changes Apr 21, 2023

View reviewed changes

MarcoGorelli mentioned this pull request Apr 27, 2023

BUG: DataFrame.first has unexpected behavior when passing a DateOffset #45908

Closed

3 tasks

github-actions bot added the Stale label May 27, 2023

DeaMariaLeon closed this Jun 2, 2023

DeaMariaLeon deleted the newbug branch June 28, 2023 11:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG fixing module first() and DateOffset #52487

BUG fixing module first() and DateOffset #52487

DeaMariaLeon commented Apr 6, 2023 •

edited

Loading

MarcoGorelli left a comment

MarcoGorelli Apr 6, 2023

DeaMariaLeon Apr 6, 2023

MarcoGorelli commented Apr 11, 2023

DeaMariaLeon commented Apr 11, 2023

MarcoGorelli left a comment

DeaMariaLeon commented Apr 11, 2023

MarcoGorelli commented Apr 11, 2023 •

edited

Loading

DeaMariaLeon commented Apr 11, 2023

MarcoGorelli commented Apr 12, 2023

DeaMariaLeon commented Apr 12, 2023

DeaMariaLeon commented Apr 14, 2023

MarcoGorelli left a comment

DeaMariaLeon commented Apr 17, 2023 •

edited

Loading

MarcoGorelli left a comment •

edited

Loading

MarcoGorelli Apr 17, 2023

DeaMariaLeon Apr 20, 2023

DeaMariaLeon commented Apr 17, 2023

jbrockmendel Apr 18, 2023

DeaMariaLeon Apr 20, 2023

jbrockmendel Apr 18, 2023

MarcoGorelli Apr 18, 2023

MarcoGorelli Apr 18, 2023

jbrockmendel Apr 18, 2023

DeaMariaLeon Apr 20, 2023

DeaMariaLeon commented Apr 21, 2023

MarcoGorelli left a comment •

edited

Loading

MarcoGorelli Apr 21, 2023

DeaMariaLeon Apr 24, 2023

DeaMariaLeon commented Apr 24, 2023

MarcoGorelli commented Apr 24, 2023

github-actions bot commented May 27, 2023

DeaMariaLeon commented May 27, 2023

DeaMariaLeon commented Jun 2, 2023

	- Fixed bug in :func:`first` when used with a DateOffset (:issue:`45908`)
	- Fixed bug in :func:`first` when used with a :class:`DateOffset` (:issue:`45908`)

BUG fixing module first() and DateOffset #52487

BUG fixing module first() and DateOffset #52487

Conversation

DeaMariaLeon commented Apr 6, 2023 • edited Loading

MarcoGorelli left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MarcoGorelli commented Apr 11, 2023

DeaMariaLeon commented Apr 11, 2023

MarcoGorelli left a comment

Choose a reason for hiding this comment

DeaMariaLeon commented Apr 11, 2023

MarcoGorelli commented Apr 11, 2023 • edited Loading

DeaMariaLeon commented Apr 11, 2023

MarcoGorelli commented Apr 12, 2023

DeaMariaLeon commented Apr 12, 2023

DeaMariaLeon commented Apr 14, 2023

MarcoGorelli left a comment

Choose a reason for hiding this comment

DeaMariaLeon commented Apr 17, 2023 • edited Loading

MarcoGorelli left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DeaMariaLeon commented Apr 17, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DeaMariaLeon commented Apr 21, 2023

MarcoGorelli left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DeaMariaLeon commented Apr 24, 2023

MarcoGorelli commented Apr 24, 2023

github-actions bot commented May 27, 2023

DeaMariaLeon commented May 27, 2023

DeaMariaLeon commented Jun 2, 2023

DeaMariaLeon commented Apr 6, 2023 •

edited

Loading

MarcoGorelli commented Apr 11, 2023 •

edited

Loading

DeaMariaLeon commented Apr 17, 2023 •

edited

Loading

MarcoGorelli left a comment •

edited

Loading

MarcoGorelli left a comment •

edited

Loading